Goto

Collaborating Authors

 smaller model



MatFormer: Nested Transformer for Elastic Inference

Neural Information Processing Systems

Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer, a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.


ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Wang, Xiao, Larionov, Daniil, Wu, Siwei, Liu, Yiqi, Eger, Steffen, Moosavi, Nafise Sadat, Lin, Chenghua

arXiv.org Artificial Intelligence

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.


fb2697869f56484404c8ceee2985b01d-AuthorFeedback.pdf

Neural Information Processing Systems

"blur the distributions": As Wasserstein barycenter adjusts the support, blurring is more likely for Euclidean V anilla averaging, in contrast, fails to fine-tune despite trying numerous settings of optimization hyperparameters. Also, Fig 1, shows similar gains for data-free post-processing in case of structured pruning (as in Sec 5.2). V anilla average fails to retrain. Results shown are mean std. "there could possibly be more competent baselines": The'constraint' of performing this without sharing of sensitive training data arises in many applications, "improvement over vanilla averaging is very marginal": We respectfully disagree.


LLM Optimization Unlocks Real-Time Pairwise Reranking

Wu, Jingyu, Shrivastava, Aditya, Zhu, Jing, Samuel, Alfy, Kumar, Anoop, Liu, Daben

arXiv.org Artificial Intelligence

Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.


Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment

Cai, Wenrui, Wang, Chengyu, Yan, Junbing, Huang, Jun, Fang, Xiangzhong

arXiv.org Artificial Intelligence

The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.


Agent-based Automated Claim Matching with Instruction-following LLMs

Pisarevskaya, Dina, Zubiaga, Arkaitz

arXiv.org Artificial Intelligence

We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs' understanding of claim matching.


Reasoning Distillation and Structural Alignment for Improved Code Generation

Jalilifard, Amir, Rocha, Anderson de Rezende, Raimundo, Marcos Medeiros

arXiv.org Artificial Intelligence

Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.


There's a simple way we could drastically cut AI energy use

New Scientist

There's a simple way we could drastically cut AI energy use Being more judicious in which AI models we use for tasks could potentially save 31.9 terawatt-hours of energy this year alone - equivalent to the output of five nuclear reactors. Tiago da Silva Barros at the University of Cote d'Azur in France and his colleagues looked at 14 different tasks that people use generative AI tools for, ranging from text generation to speech recognition and image classification. 'Flashes of brilliance and frustration': I let an AI agent run my day They then examined public leaderboards, including those hosted by the machine learning hub Hugging Face, for how different models perform. The energy efficiency of the models during inference - when an AI model produces an answer - was measured by a tool called CarbonTracker, and the total energy use of that model was calculated by tracking user downloads. "Based on the size of the model, we estimated the energy consumption, and based on this, we can try to do our estimations," says da Silva Barros.